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This paper reports on work performed during the 1978-1980 SAGE con- 
tract to develop improved national estimates from survey data. Three 
areas of effort are covered in this 'paper-. The first is the use of 
longitudinal merges cbmbiried with relational edjtts ,to detect reporting or 
encoding errors. The second is the use of longitudinal merges together 
with special follow-up surveys to improve the universe coverage; and the 
third is the use of missing data imputation. techniques to develop national 
estimates when key data elements are missing. due to^nonresponse or 
omissions. 
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RELATIONAL EDITS 

c 

The first area of SAGE work to be discussed here °was the development 
of edit specifications for data from the Common Core of Data (CCD). In 
particular, parts VI and Via of this data base include data on _each of the 
nation's public school districts (LEAs) and on each public school. While 
the number of .data elements' for eac^LEA or school is small, the very 
large number of units in each file mifees it a virtual certainty\hat data 
reporting .and/or data entry errors will creep^nto the 'file. An important* 
way in Which survey data such as these can be enhanced is to find ancK 
correct such erroneous values. 

• An efficient edit procedure must* identify a high proportion of the 
invalid responses while not also flagging so many valid responses as to 
make checking each identified case infeasitle. In' the absence of any . ' 
other information, the traditional procedure is to examine the most 
extreme .values, both because these are least Mkely to 'be valid and 
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because they have the greatest impact-. on summary statistics. Unfortun- 
ately, tfie range of valid values for these " CCD files is so great that sucK 
an edit would be meaningless. If a district served 800 pupils, but 8,000 
was erroneously entered, for example, there would be little chance of 
car/ching this error with a simple range* check. The value>of 8,000 is 
perfectly yalii for many districts. , 

The relational editing strategy proposed by SAGE uses values that are 
closely correlated with each field being edited to "predict" the value in 
question and *t hen compares the actual values with these predictions. The 
greatest discrepancies are flagged for further checking. In the example 

cited, the error in the humber of students might have been caught because 

I ' ' * * 

it led to to> an unreasonable ratio of students to teachers, or of students 

< 

to schools, in the district. * . 

By far the best predictor of any of the values in the LEA and Public 
School surveys^is the corresponding value from the prior year's survey. 
Therefore,- longitudinal merges were -proposed to allow the comparison of 
values Between successive years. To illustrate the effectiveness pf this 
approach/ son*e data were taken fromiilCES's Nonpublic School Surveys. 
Figure 1 shows the. distribution of the number of pupils served by each 
nonpublic school. This distribution is very broad. If we wanted to- 
examine only schools with the most extreme values, say thfc upper' and lower 
12, we would have to accept all j/^lues between about 5 and 1100* Figure 2 
shows the distribution of the differences between the 1977-78 values and 
the corresponding values from the 1976-77 survey. In this case tfce range 
of values • accepted without question'would be only about 200 (-100 to 100) 
rather than 1100. Most kinds 6f recording or data entry errors are rela- 
tively infrequent and random so that the probability that both the new and' 
the prior values contain^ compensating errors is negible. In this Case, 
virtually all errors of any Significant magnitude* would be flagged while 
few valid* responses would be, f lagged.^ 

• Figure 2 also shows that the difference values have a nearly normal 
distribution, particularly in comparison to the highly skewed distribution 
id 'Figure 1. To the extent that the true values do -follow a normal dis- 
tribution, we have some basis for estimating the- proportion of "error" 
♦values above or below any given cutoff by comparing the actual distribu- 
tion with' the predicted distribution: Figu|e 2 4 shdws a normal distribur 
tion superimposed-over the actual difference distribution. The r^rtively 



thicker tails of the observed distribution could be due to a greater ^ 
proportion of errors among the more extreme differences. This is, of 
course, very tentative. It 'is not necessary to estimate^the error rate 
ahead of time unless it is desired to perform some form of cost^lnefit 
analysis to determine an "optimal" cutoff point. 

One problem in "fitting" a normal model to estimate the erfdr rate is 
that the-s*»frect and error values are initially indistingui*shabl.e. If the 
.overall standard deviation is used to estimate the standard deviation of 
tfhe correct values, the resultant estimate will be too high by some 
unknown- amount since error values have an additional variance component. 
As a result we will estimate 'that more of the extreme values are' valid 
than is actually the case. In ( a recent study based on SAGE work, 
Fingerman (1981) showed that if the standard deviation of 'the'correct 
values is estimated from the interquartile distance (actually as .74 times 
the distance from the f irst^to the third quartile point), the resultant 
estimate is quite accurate, even where the proportiorf of • errors is- rela- 
tively largey^Xhe interquartile distance is influenced by the number of 
extreme cases but 'not by their degree of extremity while the usual vari- 
ance estimator is. strongly influenced by the degree of extremity of the 
most deviant cases. In a Monte Carlo simulation Fingerman found that when 
the variance of the distribution of "elrro?" cases was nine times the' 
variance of the distribution of valid .resppnses-^nd 101 of the cases we're 
in ertor, the usual variance estimate based ~on all cases was- 2. 9 times too 
large, but the estimate based on t^he interquartile distance was only 1.1 
times too large. Further, the estimate based on * trie interquartile dis- 
tance was quite 6 stable. The variance of th^ interquartile "variance" 
estimate was only 2% of the actual^ value compared to 30%, for the* usual ^ 
estimate based' on all cases. 




^UNIVERSE COVERAGE " 
For much of the work that NCES does, estimates of total's, such as. the 
tot^l number of,pupUs, schools, teachers, anjl expenditures, are critical. 
For this reason, the issue of whether the universe has__been fully covered ' 
is of particular concern. o (If we were only estimating means, omitting 
pdme- school^ from the' sampling or survey frame might not introduce" serious 
bias, but if we want to know rfe total number of students such an omission 
will necessarily re"gult in an undercount.) ' • **' 



One area of SAGE effort where the issue of coverage was of critical 
concern was in our work with the , Nonpublic Elementary and Secondary School 
• Surveys (McLaughlin & Wise, 1980). We began with affile of just- over 
IftjOOO schools from the 1977-1978, survey and merged these with a somewhat 
smaller number 0 f schools from the' 1976-77 survey. (The 76/77 files did 
not include 'nonrespondents. ) The merging process was complicated by the 
fact that there was not a common identifier so that fallible name and 
address data had to be used to^match schools. The process turned up the 
fact that both fiJLes Contained some duplicate schools with small variation 
in the names and/or addresses. More importantly, each file contained a 
number 'of sqhools that Ve re not An the other file. A sample of these 
schools were contacted and it was found that mo§t of them were in fact 
operating both years. Other special cases were also identified, such as 
the fact that Mormon schools "only reported aggregate data for the 1977-78 
survey • , " 

•'In the end, after tjie addition of the 1978-79 survey data and similar 
checking on unmatched schools, the total number of schools identified and 
considered open during 'the 1977-78 school year was estimated to be over * 
20,000 <20073) rather than the 18,103 initially identified. Needless to 
say, this reflects an increase of over 10% in the estimated number of 
nonpublic schools as well as in estimates of ; tl»- number of students and 
teachers in these schools. (Later checks of state directories by SAGE 
indicated an additional undercoverage of approximately 10% in schools, or 
1 or 2% in enrollment.) A current SAGE effort is designed to test alter- - 
nartive field strategies * for assessing the adequacy of coverage in universe 
surveys such as this.* " 

v . 

IMPUTATION OF. MISSING DATA * 

The most ambitious SA& effort in the area of survey data enhancement 
concerns' imputation of missing data. This^ffort combined work on. NCES's 
nonpublic school surveys with a general methodological development task to 
study -procedures imputing' missing values*. Separate procedures were devel- 
oped for imputing discrete (nominal) and continuous' (intentel) variables 
with or without prior year's data^ Each of the final procedures was 
^subjected to a special, val^ltion study ^Jiere known values were masked and 
; run through the imp'utatidn .procedure. The real a;nd imputed values were 
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compared to assess the extent of .bias in estimates of means, variances*, o^ 
relationships generated from the imputed values., The results of this"' N 
validation were quite promising. The overall mean bias (due to missing 
data) in estr/mates generated from 4he final data was estimated to be less 
than one-half percent. Variances and relationships ^correlations- or condi- 
tional frequencies) were also reprocjuced reasonably well. The results 
were far and away superior to the two "eadf^ <$tions for dealing with 
missing data — ignoring it or substituting mean values. 

These results apply to the final procedures, fluring the course of 
this work, we learned the hard way about a number . of pitfalls in the x / 
application of a regression approach to the imputation of missing values. 
These experiences were valuable for our subsequent Vork on general algor- 
ithm for tKeTimputation of missing data. Th^t work incorporated solutions 
to some of the sticky problems that we encountered, i including the fol- ' 
lowing. These problems illufe trite v the difficulty of avoiding serious bias 
in the values. 

Variables wi.th nonnormal distributions . Most of* the continuous vari- 
ables in this survey had strongly skewed distributions with no negative 
values and a small number of very large values. This w^s^particularly 
true for the expenditure data. ^ The" regression approach occasionally gave 
predicted values that were negative. More frequently, when we went ti N add 
a random component reflecting the prediction error (to avoid shrinking ,the 
variance of the imputed values relative to the appropriate leVel) , the 
.random component caused the imputed Value to become negative. In order to 
avoid having negative values (e.g., for enrollment) on the file, small 
positive vklues were substituted for the negative values. This, of 
course, led to a positive bias so that w£ had to introduced corresponding 
VuncatiTon of relatively large values" in order to 'compensate for the 
correction of negative values. This procedure is clearly unacceptable in 
general and id a strong rationale for use of some, form of "hot deck" 
procedure that ♦limits imputed values to *the range of actually observed 
values instead of a formula procedure such as regression. 

Problems with the v use of derived variables . In predicting missing 
values from ,prior year f s data, we were actually predicting the percent 
increase from other variables* anfi then multiplying the prior value by th$ 
predicted rate of increase. Unfortunately, this led to another bias 'since 



the expedted value of tha_product of two random variables (the prior value 
tfmes the rate of increase) is greater than the product of "their. expected 
values. . Here too a correction was developed that proved satisfactory for 
eacrf particular case. Initially, we had an even'mo^ severe problem in " 
that we attempted to predict the log of the expenditure rate rather than 
the rate itself.^ This made sense because the expenditure 'data showed a 
somewhat .lograthmic relationship to $he potential predictors. It proved 
to be a. disaster, however, since very small overestimates of the log led 
to .rapier- large overestimates of the .expendiCurfrate .itself , so" that when 
we concerted back to real dollars, we had serious. overestimates! : • 

Preserving relationsh ips among imjuted values , a' third sticky prob- ' 



lem that surfaced was the difficulty of preserving true relationships 
among Jijmputed values. For many nonresponding schools, very little'was 
known, so that most of the values were imputed, 'if e'ach .missing value was 
imputed independently from the available values, relationships between the 
missing values would nave "been missed/ 1 For example; we imputed whether " 
tlje school served boys, or g±Tls ; or tjo-th and whether the school included • 
boarding students separately from the schools religous affiliation. ■ Table 
1 shdws data from the Validation study comparing the actual and, imputed 
values. The actual 'values indicate that schools that served girls only 
• were .much- less likely to include boarding students relative to other 
schools. This relationship was not found' among the imputed. values. 

■ After having spent^ months developing tailor-made procedures for 
imputing missing values "in the nonpublic school surveys, we sought to 
create "an algorithm that would allow researchers to perform the equivalent 
work in an afternoon. The result' of this effort* was PROC IMPUTE (Wis* & 
McLaughlin, 1980), a new procedure added to the-Statis?ical Analysis 
.System (SAS). By incorporating our algorith^ into an existing, statistical 
package, we eliminated the need for a -researcher to duplicate efforts 
already spent, defining variables,, labels, missing data codes, etc. We 
also made the procedure more powerful in that it could ■ be' combined with 
the great flexibility already available in "the SAS systjAor taking 
samples of cases, recoding variables, merging in additional data, and 
saving intermediate files. . J ' 

The basic approach used in PROC IMPUTE that a regression equation 
'is developed for each variable with any missing values. For each equar 
tion, a two-way table giving the frequency of the actual values by the 
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Table 1 



Actual and Imputed Rela£ionship 
between sex served and Boarding Facilities 
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predicted, regression function, values (divided into discrete categories) 
is developed. Figure* 3 illustrates such a contingency table. For'feach 
missing value, a "predicted" value is generated, using the regression 
function- and then'an "actual" value is. selected randomly with probability ' 
proportional to^ the frequencies in the. row of the two-way table corre- A 

- sponding to the predicted -value. -By using .this procedure instead o^ just 
using the predicted values, we arp certain. that only values that* actually s 
bccur are selected- as imputed values, and we assu^ an appropriate varia- . 
tion for the imputed 5 values. ■ +' Y\ 

One other feature of PROC IMPUTE is that the regression equations are 
developed in a "stepwise" manner." The first variable is imputed only from 
variables with no missing vdlues. Each succeeding variable includes * the . 
variables already imputed as'potential predictors so that' imputed values 
are used in imputing other missing values. This is a si&nif ican^dif f er- 
ence'from the BMDP procedure where only nonmissing values *re used as . 
predictors. ^ After each missing' value has been imputed, the procedure 
generates a second equation for "reimpufctngf each variable with missing 
values from all -other variables. '.In practice, this second imputation' is 

^performed only if variables t?hat were excluded in the initial imputation 
(because there came later in the initial list) had a significant correla- 
tion with the- variable being *im]?uted after partialling out the predictors 
that were used. In this way any -significant relationships between vari- 
ables^/itir missing values are v preserved, since each is used in the .predic- 
tion of tfre other. A special procedure was developed to select an optimal 
ordering of the variables 'for the initial imputations. This procedure 
performs a "simultaneous" step-wise regression, for all variables wittu 
missing values. At each step, a target variable and "a' new predictor 
variable are chosen that maximally reduce £he uncertainty in the remaining 
mfssing values subject to the constraints imposed by the existing partial 
ordering of the variables. ^The pair selected then adds a'* new oWer 
constraint, t;hat the predictor must precede the target variable in the 

-imputation list. The, process is continued until no more significant' * 
predictors are available.' * 

\ Tabled shows some results of a M<iate Carlo study comparing the . 

results of PROC IMPUTE tq the results of the BMDP procedure. The results 
stow that? PROC IMPUTE was indeed successful at reproducing variances and 
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FIGURE 3. Distribution of. target variable for Wc^gress ion-function subset 
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SCHOOL DISTRICT SES MEASURE 1 

(School TV'Uti ligation Study: NCES, 1979) 



Notd". Thet, regression function was selected to account for maximum 
variance 19 the SES measure. Values were then 'partitioned into 
9 discrete categories. The "n w refers to the numcer of cases ;n 
eacn regression-function category. f -' 



. • - Table 2 • 

Processing Time and Accuracy 'of Different 
BMDPAM Options and PROC IMPUTE 





Processing 
Time* 


Error of 
Mean. Estimate** 


* * * 

S.D. Estimate** 


Error of 
Correlation 
Estimate*** 


BMDPAM Options: 






y • . ■ 




Mean Substitution 


3.8 


.549 




.33 


Singel Variable 


5.6 


-.403 


, .558 




Two Step 


6.8 


'.400 « 


;527 




Total Regression ' 


11.8 


.392 


.501 




Stepwise Regression 
PROC IMPUTE 


25.6 
8.2 


.390 
.383 


.508 
.105 


.21 
.15 



* For a file with 20 variables and 1,000 observations. The processing time is 
, in CPU seconds for an IBM 370/168 running under MVS. 

** Average absolute error across 20 variables expressed in standard deviation 
units. 

*** Root mean square errors averaged** across all pairs of variables and all 
replications. 
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correlations while' not sacrificing much in the accuracy of mean predic- 
tions. Copies of .this procedure and instructions for setcing up an appro- 
priate SAS library can be obtained from AIR at cost. * 

» • * * 

SUMMARY . ' , 

During the pas^two years SAGE has worked on the enhancement of 
survey data as one of its main themes. The work described here on the use 
of longitudinal merges for enhancing edits and for improving universe 
coverage and on the development of missing data imputation procedures that 
6 an be applied to a wide range of survey*. The current SAGE team is 
continuing work in the area of survey data enhancement including £he 
development of survey error profiles and the study of appropriate analytic 
techniques. , 
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